17.3 Gene Identification

255

17.2

DNA Methylation Profiling

Epigenetic information is lost during standard Sanger sequencing or NGS because

the methylated groups are treated as cytosine by the enzymes involved in PCR.

Although the overall proportion of methylated DNA can be determined chemically,

in order to properly understand the regulatory rôle of methylation, it is necessary

to determine the methylation status of each base in sequence (bearing in mind that

only CpG is methylated). The methylation status of a nucleotide can be determined

by pyrosequencing (Sect. 17.1.3), but that technique is limited to relatively short

nucleotide sequences. A more recent method relies on treating DNA with bisulfite

(under acidic conditions cytosine is converted to uracil, and methylated cytosine is

not) and comparing the sequence with the untreated one. 12 Even newer is the tech-

nique called MethylCap-seq 13: The DNA is sonicated, fragmenting it to pieces with a

length of around 300 base pairs, which are then exposed to MBD-GST immobilized

on magnetic beads, which captures methylated fragments at low concentrations of

NaCl; a gradient of increasing salt concentration elutes the DNA fragments from the

beads. Epigenetic profiling is of growing importance to medicine. 14

17.3

Gene Identification

Gene identification (or “gene finding”) is the process of identifying regions in the

genome that are likely to correspond to genes, using a combination of computational

algorithms, statistical analysis, and other bioinformatics tools. Other features, such

as regulatory elements and splice sites, may assist the finding process. The ultimate

goal of gene identification (or “gene prediction”) is automatic annotation: to identify

all biochemically active portions of the genome by algorithmically processing the

sequence and to predict the reactions and reaction products of those portions coding

for proteins. At present we are still some way from this goal. Success will not only

allow one to discover the functions of natural genes but should also enable the

biochemistry of new, artificial sequences to be predicted and, ultimately, to prescribe

the sequence necessary to accomplish a given function.

In eukaryotes, the complicated exon–intron structure of the genome makes it par-

ticularly difficult to predict the course of the key operations of transcription, splic-

ing, and translation from a sequence alone (even without the possibility that essential

instructions encoded in acylation of histones, etc. are transmitted epigenetically from

generation to generation).

Challenges remain in identifying the exons, introns, promoters, and so on in each

stretch of DNA, such that the exons could be grouped into genes and the promoters

12 Bibikova et al. (2006); Bibikova and Fan (2010).

13 Brinkman et al. (2010); for other methods, see Zuo et al. (2009).

14 See, e.g., Heyn and Esteller (2012).